196 research outputs found
On the Feasibility of Automated Detection of Allusive Text Reuse
The detection of allusive text reuse is particularly challenging due to the
sparse evidence on which allusive references rely---commonly based on none or
very few shared words. Arguably, lexical semantics can be resorted to since
uncovering semantic relations between words has the potential to increase the
support underlying the allusion and alleviate the lexical sparsity. A further
obstacle is the lack of evaluation benchmark corpora, largely due to the highly
interpretative character of the annotation process. In the present paper, we
aim to elucidate the feasibility of automated allusion detection. We approach
the matter from an Information Retrieval perspective in which referencing texts
act as queries and referenced texts as relevant documents to be retrieved, and
estimate the difficulty of benchmark corpus compilation by a novel
inter-annotator agreement study on query segmentation. Furthermore, we
investigate to what extent the integration of lexical semantic information
derived from distributional models and ontologies can aid retrieving cases of
allusive reuse. The results show that (i) despite low agreement scores, using
manual queries considerably improves retrieval performance with respect to a
windowing approach, and that (ii) retrieval performance can be moderately
boosted with distributional semantics
Character-level Transformer-based Neural Machine Translation
Neural machine translation (NMT) is nowadays commonly applied at the subword
level, using byte-pair encoding. A promising alternative approach focuses on
character-level translation, which simplifies processing pipelines in NMT
considerably. This approach, however, must consider relatively longer
sequences, rendering the training process prohibitively expensive. In this
paper, we discuss a novel, Transformer-based approach, that we compare, both in
speed and in quality to the Transformer at subword and character levels, as
well as previously developed character-level models. We evaluate our models on
4 language pairs from WMT'15: DE-EN, CS-EN, FI-EN and RU-EN. The proposed novel
architecture can be trained on a single GPU and is 34% percent faster than the
character-level Transformer; still, the obtained results are at least on par
with it. In addition, our proposed model outperforms the subword-level model in
FI-EN and shows close results in CS-EN. To stimulate further research in this
area and close the gap with subword-level NMT, we make all our code and models
publicly available
Advances in Distant Diplomatics: A Stylometric Approach to Medieval Charters
The quantitative analysis of writing style (stylometry) is becoming an increasingly common research instrument in philology. When it comes to medieval texts, such a methodology might be able to help us disentangle the multiple authorial strata that can often be discerned in them (issuer, dictator, scribe, etc.). To deliver a proof of concept in 'distant diplomatics,' we have turned to a corpus of twelfth-century Latin charters from the Cambrai episcopal chancery. We subjected this collection to an (unsupervised) stylometric modelling procedure, based on lexical frequency extraction and dimension reduction. In the absence of a sizable 'ground truth' for this material, we zoomed in on a specific case study, namely the oeuvre of the previously identified dictator-scribe known as 'RogF/JeanE.' Our results offer additional support for the attribution of a diplomatic oeuvre to this individual and even allow us to enlarge it with additional documents. Our analysis moreover yielded the serendipitous discovery of another, previously unnoticed, oeuvre, which we tentatively attribute to a scribe-dictator 'JeanB.' We conclude that the large-scale stylometric analysis is a promising methodology for digital diplomatics. More efforts, however, will have to be invested in establishing gold standards for this method to realize its full potential
From exemplar to copy: the scribal appropriation of a Hadewijch manuscript computationally explored
This study is devoted to two of the oldest known manuscripts in which the
oeuvre of the medieval mystical author Hadewijch has been preserved: Brussels,
KBR, 2879-2880 (ms. A) and Brussels, KBR, 2877-2878 (ms. B). On the basis of
codicological and contextual arguments, it is assumed that the scribe who
produced B used A as an exemplar. While the similarities in both layout and
content between the two manuscripts are striking, the present article seeks to
identify the differences. After all, regardless of the intention to produce a
copy that closely follows the exemplar, subtle linguistic variation is
apparent. Divergences relate to spelling conventions, but also to the way in
which words are abbreviated (and the extent to which abbreviations occur). The
present study investigates the spelling profiles of the scribes who produced
mss. A and B in a computational way. In the first part of this study, we will
present both manuscripts in more detail, after which we will consider prior
research carried out on scribal profiling. The current study both builds and
expands on Kestemont (2015). Next, we outline the methodology used to analyse
and measure the degree of scribal appropriation that took place when ms. B was
copied off the exemplar ms. A. After this, we will discuss the results
obtained, focusing on the scribal variation that can be found both at the level
of individual words and n-grams. To this end, we use machine learning to
identify the most distinctive features that separate manuscript A from B.
Finally, we look at possible diachronic trends in the appropriation by B's
scribe of his exemplar. We argue that scribal takeovers in the exemplar impacts
the practice of the copying scribe, while transitions to a different content
matter cause little to no effect
DHBeNeLux : incubator for digital humanities in Belgium, the Netherlands and Luxembourg
Digital Humanities BeNeLux is a grass roots initiative to foster knowledge networking and dissemination in digital humanities in Belgium, the Netherlands, and Luxembourg. This special issue highlights a selection of the work that was presented at the DHBenelux 2015 Conference by way of anthology for the digital humanities currently being done in the Benelux area and beyond. The introduction describes why this grass roots initiative came about and how DHBenelux is currently supporting community building and knowledge exchange for digital humanities in the Benelux area and how this is integrating regional digital humanities in the larger international digital humanities environment
Collaborative authorship in the twelfth century: a stylometric study of Hildegard of Bingen and Guibert of Gembloux
Abstract – Hildegard of Bingen (1098–1179) is one of the most influential female authors of the Middle Ages. From the point of view of computational stylistics, the oeuvre attributed to Hildegard is fascinating. Hildegard dictated her texts to secretaries in Latin, a language of which she did not master all grammatical subtleties. She therefore allowed her scribes to correct her spelling and grammar. Especially Hildegard’s last collaborator, Guibert of Gembloux, seems to have considerably reworked her works during his secretaryship. Whereas her other scribes were only allowed to make superficial linguistic changes, Hildegard would have permitted Guibert to render her language stylistically more elegant. In this article, we focus on two shorter texts: the Visio ad Guibertum missa and Visio de sancto Martino, both of which Hildegard allegedly authored during Guibert’s secretaryship. We analyse a corpus containing the letter collections of Hildegard, Guibert and Bernard of Clairvaux using a number of common stylometric techniques. We discuss our results in the light of the Synergy Hypothesis, suggesting that texts resulting from collaboration can display a style markedly different from that of the collaborating authors. Finally, we demonstrate that Guibert must have reworked the disputed visionary texts allegedly authored by Hildegard to such an extent that style-oriented computational procedures attribute the texts to Guibert
- …